-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
fix(alerts): Don't group on time #89447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
referrer_suffix=referrer_suffix, | ||
) | ||
result: Mapping[int, int] = {} | ||
if tsdb_function == tsdb.get_sums: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to add group_on_time
to the other tsdb functions we may pass since they don't use it anyway, but this feels a little hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option could be to use functools.partial(tsdb.get_sums, group_on_time=True)
when you pass it, instead of passing it through here.
I think either that, or adding group_on_time
as a param to all funcs is the right move
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to do this for get_distinct_counts_totals
too
referrer_suffix=referrer_suffix, | ||
) | ||
result: Mapping[int, int] = {} | ||
if tsdb_function == tsdb.get_sums: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option could be to use functools.partial(tsdb.get_sums, group_on_time=True)
when you pass it, instead of passing it through here.
I think either that, or adding group_on_time
as a param to all funcs is the right move
group_1_id: 5, | ||
group_2_id: 5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the counts 5 here, when we created only 4 events per group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this is because the base setup creates events with the same fingerprint, I'll update the fingerprint to be different
Codecov ReportAll modified and coverable lines are covered by tests ✅ ✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## master #89447 +/- ##
=======================================
Coverage 87.71% 87.71%
=======================================
Files 10180 10180
Lines 574508 574541 +33
Branches 22632 22632
=======================================
+ Hits 503932 503969 +37
+ Misses 70148 70144 -4
Partials 428 428 |
d21577b
to
53ca331
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, let's merge it when we can be around to watch the deploy though
Issue alerts that have "slow" conditions over a large time period (1 day, 1 week, 1 month) can hit the 10,000 limit we have for results when there are a lot of group to process as we return timestamped data per group (e.g. group id 1 has data for 12:01, 12:02, 12:03, and so on and that's multiplied by the number of groups we're querying) and we'll drop the remaining data past the limit which results in lower numbers than we'd expect. This can cause alerts to not fire as the data we're receiving isn't accurate. To fix this we'll disable `group_on_time` and instead of receiving all the timestamp data that we add up later, we'll simply receive the sum without the timestamps. I renamed `get_sums` to `get_timeseries_sums` and replaced the usages that rely on the timeseries data with that. The places that use `get_sums` now expect to receive the aggregate data in a simple dictionary `{group_id: count}`.
Suspect IssuesThis pull request was deployed and Sentry observed the following issues:
Did you find this useful? React with a 👍 or 👎 |
Issue alerts that have "slow" conditions over a large time period (1 day, 1 week, 1 month) can hit the 10,000 limit we have for results when there are a lot of group to process as we return timestamped data per group (e.g. group id 1 has data for 12:01, 12:02, 12:03, and so on and that's multiplied by the number of groups we're querying) and we'll drop the remaining data past the limit which results in lower numbers than we'd expect. This can cause alerts to not fire as the data we're receiving isn't accurate.
To fix this we'll disable
group_on_time
and instead of receiving all the timestamp data that we add up later, we'll simply receive the sum without the timestamps.I renamed
get_sums
toget_timeseries_sums
and replaced the usages that rely on the timeseries data with that. The places that useget_sums
now expect to receive the aggregate data in a simple dictionary{group_id: count}
.